Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metrics] add block analyzer queuelength metrics #577

Merged
merged 1 commit into from
Jan 17, 2024

Conversation

Andrew7234
Copy link
Collaborator

@Andrew7234 Andrew7234 commented Nov 30, 2023

Task:

Expose 'queueLength' metric equivalent in block analyzers

For more context, see https://github.com/oasisprotocol/nexus/pull/511/files#r1318043060

It'd be nice to report the difference between the indexed block height and the node block height in prometheus. We already have both heights stored by the node_stats analyzer

This PR

To enable the queuelength metric for runtime block analyzers, nexus needs the current chain heights of the runtimes. The node_stats analyzer currently fetches/stores this data only for consensus, so this PR also adds runtime support for the node_stats analyzer.

Note: We could avoid this by fetching the chain heights directly in the block analyzers. Although the height would be fresher, it would also require a second round-trip in the main block processing loop. Since the metrics are internal-only I opted to use the chain heights we already store in chain.latest_node_heights. Open to suggestions though.

Alternatively alternatively, we could put the metric updating thing in a concurrent loop separate from the main block processing loop. Not sure if it's worth the extra complexity since slow-sync is, by name, expected to be slower.

~/oasis/oasis-block-indexer(andrew7234/block-analyzer-queuelength*) » curl localhost:8009/metrics | grep queue_length                 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 43060    0 43060    0     0  7934k      0 --:--:-- --:--:-- --:--:-- 41.0M
# HELP consensus_queue_length number of blocks left to process for the consensus analyzer
# TYPE consensus_queue_length gauge
consensus_queue_length 0 
^ failed bc couldn't run consensus analyzer locally. Default value of a prometheus gauge is 0, which is what we see here

# HELP emerald_queue_length number of blocks left to process for the emerald analyzer
# TYPE emerald_queue_length gauge
emerald_queue_length 2.392834e+06
# HELP sapphire_queue_length number of blocks left to process for the sapphire analyzer
# TYPE sapphire_queue_length gauge
sapphire_queue_length 3.726054e+06

@Andrew7234 Andrew7234 changed the base branch from main to mitjat/3phase-timings November 30, 2023 22:03
analyzer/block/block.go Outdated Show resolved Hide resolved
@Andrew7234 Andrew7234 force-pushed the andrew7234/block-analyzer-queuelength branch from 34865fa to 83c67c6 Compare December 1, 2023 23:04
@Andrew7234 Andrew7234 changed the title Andrew7234/block analyzer queuelength [metrics] add block analyzer queuelength metrics Dec 1, 2023
@Andrew7234 Andrew7234 marked this pull request as ready for review December 4, 2023 13:08
Copy link
Contributor

@mitjat mitjat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Andy!

analyzer/node_stats/node_stats.go Outdated Show resolved Hide resolved
analyzer/block/block.go Outdated Show resolved Hide resolved
analyzer/block/block.go Outdated Show resolved Hide resolved
analyzer/block/block.go Outdated Show resolved Hide resolved
if err1 != nil {
return nil, err1
}
return nodestats.NewAnalyzer(cfg.Analyzers.NodeStats.ItemBasedAnalyzerConfig, sourceClient, emeraldClient, sapphireClient, dbClient, logger)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for expanding the node stats analyzer!

It's not great that we're hardcoding the dependency on all the runtimes here. Imagine Sapphire becomes unavailable and we still want stats for consensus; or that we're wanting to test nodestats locally or in (nonexistent :/) tests, but don't go through the hassle of setting up access to all runtimes.

Three options I can think of:

  • Add a config flag (list) for specifying which layers to include; this is probably the most "proper" solution, and quite usable too if the default is [consensus, sapphire, emerald].
  • Decide which layers to include based on the presence/absence of their block analyzers. That's very convenient but leads to implicit dependencies between analyzers / sections of config, which is super ugly.
  • Leave as-is. If e.g. sapphire becomes unavailable, things will continue to work (thanks to lazy initialization of node clients in Connect to oasis-node lazily #555), we'll just see a good amount of error spam in logs.

Your call on whether to go with 1 or 3. I think both are justifiable. I'm partially writing them out to check if option 3 holds the way I've described it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with option 1!

Some notes for future reference:
The code here creates the consensus/runtime clients for all layers even if they weren't specified in the config file. The LazyGrpcConnect function mentioned above allows us to connect to oasis-node only when needed. However, when we instantiate a new runtime client we also make a connections.SDKConnect call a few lines above here, which seems like it would error. However, the underlying connection code only establishes the connection and checks the chainContext; it does not fetch any runtime-specific info. In almost all cases, the default rpc node specified in the config file will pass this check successfully. Thus the only failure case should be when a runtime node is explicitly specified and is down. In this case, the failure is immediate and obvious and can be circumvented by either a) restoring the node or b) removing the problematic layer from the node-stats config list.

@Andrew7234 Andrew7234 force-pushed the andrew7234/block-analyzer-queuelength branch from 83c67c6 to 81ff26a Compare December 14, 2023 06:00
@mitjat mitjat force-pushed the mitjat/3phase-timings branch 2 times, most recently from 3a832f7 to 816e08a Compare December 23, 2023 00:18
Base automatically changed from mitjat/3phase-timings to main December 23, 2023 00:23
Copy link
Contributor

@mitjat mitjat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! The are one or two not-entirely trivial comment threads still unresolved, but I expect smooth sails from here on out. I'm LGTMing now as holidays will make it harder to sync.

It looks like you might also have to be careful with the rebase now that #572 is merged? Hopefully not 🤞

analyzer/block/block.go Outdated Show resolved Hide resolved
analyzer/block/block.go Outdated Show resolved Hide resolved
analyzer/consensus/consensus.go Outdated Show resolved Hide resolved
@Andrew7234 Andrew7234 force-pushed the andrew7234/block-analyzer-queuelength branch 2 times, most recently from 48e2d9d to 15b3db5 Compare January 4, 2024 17:47
@Andrew7234 Andrew7234 force-pushed the andrew7234/block-analyzer-queuelength branch 2 times, most recently from 48e50fc to d83c474 Compare January 12, 2024 19:41
nit

enable node stats analyzer for runtimes

wip

tweaks

nit

misc

address comments

address comments

nit

fix tests/nits

lint
@Andrew7234 Andrew7234 force-pushed the andrew7234/block-analyzer-queuelength branch from d83c474 to cf37b92 Compare January 17, 2024 21:27
@Andrew7234 Andrew7234 merged commit fc099d7 into main Jan 17, 2024
6 checks passed
@Andrew7234 Andrew7234 deleted the andrew7234/block-analyzer-queuelength branch January 17, 2024 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants